Search CORE

24 research outputs found

A Comparison of Blocking Methods for Record Linkage

Author: A. Goldenberg
D. Vatsalan
H. Liang
L. Paulevé
M. Kuzu
P. Christen
P. Christen
P. Christen
R. Hall
S. Fortunato
T. Herzog
Publication venue
Publication date: 01/01/2014
Field of study

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.Comment: 22 pages, 2 tables, 7 figure

arXiv.org e-Print Archive

Crossref

Estimating parameters for probabilistic linkage of privacy-preserved datasets.

Author: A Ferrante
A Wajda
Adrian P. Brown
Anna M. Ferrante
D Vatsalan
GP Basharin
I Fellegi
James B. Semmens
James H. Boyd
JH Boyd
MA Jaro
R Schnell
R Schnell
S Randall
Sean M. Randall
SL DuVall
SM Randall
TC Ong
WE Winkler
Y Thibaudeau
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Methods: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Results: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. Conclusions: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets

Crossref

Directory of Open Access Journals

espace@Curtin

Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets.

Author: A McCallum
Adrian P. Brown
Christian Borgs
CJ Bradley
D Karapiperis
D Rosman
D Vatsalan
DP Jutte
E Durham
EA Durham
EL Brook
F Niedermeyer
G Lawrence
GH Shah
IA Binswanger
J Smith
JH Boyd
JJ Trinckes
JMM Evans
M Kroll
M Kuzu
M Kuzu
MA Hernández
MG Maxfield
P Christen
R Schnell
R Schnell
R Schnell
R Schnell
R Schnell
R Schnell
Rainer Schnell
SA McDonald
Sean M. Randall
SM Randall
SM Randall
TG Kristensen
TL Dassanayake
TN Herzog
Z Wan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background: Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now. Methods: Using a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated. Results: Clear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters. Conclusions: We argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed

Crossref

Directory of Open Access Journals

espace@Curtin

Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

Author: A Beimel
A Geissbuhler
AF Karr
AF Karr
AL Potosky
Antonis Michalas
AS Lunde
B Pinkas
BA Malin
BA Stewart
BH Bloom
C Clifton
C Friedman
C Quantin
D Vatsalan
EA Durham
G Cormode
G Hripcsak
GM Weber
GM Weber
GM Weber
IS Kohane
J Gichoya
J Vaidya
JF Ludvigsson
JH Holmes
JL Warren
Johan Gustav Bellika
JT Finnell
K Emam El
K Emam El
K Emam El
K Emam El
Kassaye Yitbarek Yigzaw
L Fan
L Lenert
LH Curtis
M Kantarcioglu
MA Hailemichael
MA Hernández
MK Ross
O Goldreich
P Christen
P Paillier
P Saint-Andre
R Cramer
R Lazarus
R Lazarus
R Schnell
RL Richesson
S Tarkoma
SC Pohlig
SM Randall
T Dimitriou
W Du
W Du
WB Lober
Y Lindell
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. Methods We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. Results The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N − 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. Conclusions The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians

Crossref

WestminsterResearch

Efficient protocols for private record linkage

Author: Goldreich O.
Huang Y.
Kuzu M.
Ravikumar P.
Vatsalan D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Record linkage allows data from different sources to be integrated to facilitate data mining tasks. However, in many cases, records have to be linked by personally identifiable information. To prevent privacy breaches, ideally records should be linked in a private way such that no information other than the matching result is leaked in the process. In this paper, we present an exact Private Record Linkage (PRL) protocol and an approximate PRL protocol. The exact PRL protocol is based on Oblivious Bloom Intersection, which is an efficient private set intersection protocol. The approximate PRL protocol extends the exact PRL protocol by incorporating Locality Sensitive Hash functions. Both protocols are secure in the semi-honest model. We also report the evaluation results based on our C implementation of the protocols. The results show that our protocols are efficient and effective

Crossref

University of Strathclyde Institutional Repository

Efficient private multi-party numerical records matching

Author: D Vatsalan
D Vatsalan
G L Xiang
S Trepetin
T Churches
V S Verykios
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Private Blocking Technique for Multi-party Privacy-Preserving Record Linkage

Author: A Karakasidis
D Vatsalan
L Sweeney
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref